{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 17 - Cross tabulation\n", "\n", "Cross tabulation is a method to count how many data points fits in different categoriese. The output is a *contingency table* or *cross tab* which contains this information.\n", "\n", "This lab will use:\n", "- the insurance dataset from Labs 7,8, and 13 (download [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/insurance.csv))\n", "- the dataset of all film permits issued in New York city, which on NYC Open Data [here](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p). To download: \n", " - click \"View Data\" (blue button in upper right)\n", " - on the next page, click \"Export\" (in menu in upper right)\n", " - click \"CSV\" to download\n", " - if you have trouble, the dataset is [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/Film_Permits.csv)\n", " \n", "Import the necessary packages and tell Jupyter to display all plots and all dataframe columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "\n", "%matplotlib inline\n", "\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Insurance Data\n", "\n", "Read the insurance data into the dataframe `insurance` and check that it was read in correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make a continguency table of the sex and smoker columns as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(insurance[\"sex\"], insurance[\"smoker\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many female smokers are there? How many male non-smokers?\n", "\n", "If we wanted to know how much data is in a particular category (say non-smokers), we could add up the values in that row or column. But Pandas can do this for us, by adding the parameter `margins = True`. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How could you compute the proportion of male smokers?\n", "\n", "We can also have Pandas compute the proportions instead of the number of people fitting each category. Add the parameter `normalize = True`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the proportion of male smokers in the data? What's the proportion of females in the data?\n", "\n", "We can include a third category in our continguency table as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(insurance[\"region\"],[insurance[\"sex\"],insurance[\"smoker\"]], margins = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many female smokers live in the northeast?\n", "\n", "## Film Permit data\n", "\n", "Now let's look at the film permit data.\n", "\n", "Load the CSV file into a dataframe called `Film` and check that it was created correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a contingency table with the boroughs as the rows and the category of permits (Commercial, Film, etc.) as the columns. Include the marginal totals using the `margins` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many film permits were issued for documentaries in the Bronx? How many film permits were issued for music videos in Queens?\n", "\n", "Normalize the frequencies (counts) to get the proportions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What proportion of film permits were for television in Manhattan? What proportion of film permits were for theater in any borough?\n", "\n", "This continguency table is a little large. We can visualize the patterns in it using a *heatmap* (from the Seaborn package). First we save a simple contingency table as a variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "film_tab = pd.crosstab(film[\"Borough\"],film[\"Category\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we pass it into the `heatmap()` function in Seaborn as a parameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(film_tab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there is also a sub category column for the type of film, commercial, etc.\n", "\n", "Can you create a contingency table with the boroughs as the rows and the columns being the categories, broken into their subcategories? \n", "\n", "Hint: See the last example with the insurance data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }